AI Bootcamp
• Mathematically well-defined and solves reasonably narrow tasks.
• Usually construct predictive models from data, instead of explicitly programming them.
• “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”
— Tom Mitchell, Carnegie Mellon University, 1998
• “Field of study that gives computers the ability to learn without being explicitly programmed.”
— Arthur Samuel (1959)
Machine Learning is transforming industries and daily life. Some key applications include:
• Search engines (e.g. Google)
• Recommender systems (e.g. Netflix)
• Automatic translation (e.g. Google Translate)
• Speech understanding (e.g. Siri, Alexa)
• Game playing (e.g. AlphaGo)
• Self-driving cars
• Personalized medicine
• Progress in all sciences: Genetics, astronomy, chemistry, neurology, physics, …
Many people are confused what these terms actually mean.
And what does all this have to do with statistics?
• General term for very large and rapidly developing field.
• No strict definition, but often used when machines perform tasks that could only be solved by humans or are very difficult and assumed to require “intelligence”.
• Started in the 1940s – when the computer was invented. Turing and von Neumann immediately asked: If we can formalize computation, can we use that to formalize “thinking”?
• Includes ML, NLP, computer vision, robotics, planning, search, intelligent agents, …
• Sometimes misused as a “hype” term for ML or … basic data analysis.
• Or people refer to the fascinating developments in the area of foundation models
• Subfield of AI that investigates methods that allow computers to learn.
• Focus on: How can we let machines learn?
• Statistical learning theory: How can we measure and guarantee that a machine learns?
• Includes ML, NLP, computer vision, robotics, planning, search, intelligent agents, …
• Sometimes misused as a “hype” term for … basic data analysis.
• Or people refer to the fascinating developments in the area of foundation models
• Subfield of ML which studies neural networks.
• Artificial neural networks are roughly inspired by the human brain, but we treat them as useful, mathematical models.
• Studied for decades (start in the 1940/50s). Uses more layers, might use specific neurons, e.g., for images, many computational improvements to train on large data.
• Can be used on tabular data, but typical applications are images, texts or signals.
• Last 15-20 years have produced remarkable results and imitations of human ability, where the result looked intelligent.
“Any sufficiently advanced technology is indistinguishable from magic.”
Arthur C. Clarke’s 3rd law
• Historically developed as different fields, but many methods and concepts are pretty much the same.
• ML: Rather accurate predictions with more complex models.
• Stats: More interpreting relationships and sound inference.
• Now: Both basically work on same problems with same tools.
• Communities are still divided.
• Often different terminology for the same concepts.
• Most parts of ML we could also call: Nonparametric statistics plus efficient numerical optimization.
• Personal opinion: Nowadays few practical differences, seeing differences instead of commonalities mainly holds you back.
• Supervised Learning: learn a model from labeled data (ground truth)
Given a new input X, predict the right output y
Given examples of stars and galaxies, identify new objects in the sky
• Unsupervised Learning: explore the structure of the data (X) to extract meaningful information
Given inputs X, find which ones are special, similar, anomalous, …
• Semi-Supervised Learning: learn a model from (few) labeled and (many) unlabeled examples
Unlabeled examples add information about which new examples are likely to occur
• Reinforcement Learning: develop an agent that improves its performance based on interactions with the environment
• Learn a model from labeled training data, then make predictions
• Supervised: we know the correct/desired outcome (label)
• Subtypes: classification (predict a class) and regression (predict a numeric value)
• Most supervised algorithms that we will see can do both
Supervised learning can be applied to two main types of problems:
• Classification: Where the output is a categorical variable (e.g., spam vs. non-spam emails, yes vs. no).
• Regression: Where the output is a continuous variable (e.g., predicting house prices, stock).
Regression is a type of supervised machine learning where algorithms learn from the data to predict continuous values such as sales, salary, weight, or temperature. For example:A dataset containing features of the house such as lot size, number of bedrooms, number of baths, neighborhood, etc. and the price of the house, a Regression algorithm can be trained to learn the relationship between the features and the price of the house.
• Predict a continuous value.
• Target variable is numeric
• Some algorithms can return a confidence interval
• Find the relationship between predictors and the target.
There are many machine learning algorithms that can be used for regression tasks. Some of them are:
• Linear Regression
• Multiple Regression
• Decision Tree
• Random Forest
• Gradient Boosting Regression
Linear Regression is a supervised learning algorithm that is used to model the relationship between a dependent variable and an independent variable. The algorithm finds the best fit straight line relationship (linear equation) between the two variables. This statistical method is then used to predict the outcome of future events and is quite useful for predictive analysis.
• Goal: We want to predict a continuous number (e.g., House Price, Temperature, Stock Value) based on input data.
• Input (\(X\)): Features (Square footage, number of rooms, location)
• Output (\(y\)): The Target (Price)
Notation:
\(M\) = number of training examples
\(x\) = input variable / feature
\(y\) = output/target variable
\((x, y)\) = one training example
\((x^{(i)}, y^{(i)})\) = the \(i\)th training example
Model Representation:
Cost Function(Hypothesis):
• Hypothesis: \(h(x) = ax+b\), where \(a\) and \(b\) are called parameters
• Hypothesis: \(h(x) = ax+b\)
Goal: Choose \(a\) and \(b\) so that : \(h(x) = ax+b\) is close to \(y\) for the training example \((x,y)\)
For each \((x^{(i)}, y^{(i)})\): Minimize \(|h(x^{(i)}) - y^{(i)}|\)
\[\Rightarrow \text{Minimize } \frac{1}{m} \sum_{i=1}^m |h(x^{(i)}) - y^{(i)}|\]
Squared Error Function:
\[J(a, b) = \frac{1}{m} \sum_{i=1}^m \left(h(x^{(i)}) - y^{(i)}\right)^2\] \[\Rightarrow \min_{a, b} J(a, b)\]
Consider, we have cost function \(J(a,b)\), and our goal is to minimize \(\min_{a, b} J(a, b)\)
Algorithm Outline:
- Start with some value of \(a\) and \(b\)
- Keep changing \(a\) and \(b\) to reduce \(J (a, b)\) until hopefully we end up at a minimum
Things to consider:
- choose the learning rate \(\alpha\)
- Global minimum vs Local minimum
Given a dataset of land price as illustrated in the table below, find a linear regression model which fits the data. Train the model using the gradient descent algorithm(implement from scratch).
Notation
- \(n\) = number of features
- \(x^{(i)}\) = input features of the \(i\)th example
- \(x_j^{(i)}\) = value of feature \(j\) of the \(i\)th example
Notation
- Hypothesis: \((h(x) = a x_1 + b x_2 + c)\)
or \((h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2)\)
or \((h(x) = \sum_{j=0}^n \theta_j x_j\)\), where \((x_0 = 1) and (n = 2)\)
- Cost function: \((J(\theta_0, \theta_1, \ldots, \theta_n) = J(\theta) = \frac{1}{m} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})^2)\)
Make sure all features are on a similar scale
- The dataset is divided into two parts: training set and test set
- Training set: used to learn patterns and fit the model
- Test set: used to evaluate how well the model generalizes to unseen data
- Prevents overfitting and over-optimistic accuracy
- Analogy: Training = practice questions, Test = real exam
Learning Rate
Learning rate is a floating point number you set that influences how quickly the model converges. If the learning rate is too low, the model can take a long time to converge. However, if the learning rate is too high, the model never converges, but instead bounces around the weights and bias that minimize the loss. The goal is to pick a learning rate that’s not too high nor too low so that the model converges quickly. The ideal learning rate helps the model to converge within a reasonable number of iterations.
In Figure on the right, the loss curve shows the model significantly improving during the first 20 iterations before beginning to converge
In contrast, a learning rate that’s too small can take too many iterations to converge. In this second figure, loss graph showing a model trained with a small learning rate, the loss curve shows the model making only minor improvements after each iteration
A learning rate that’s too large never converges because each iteration either causes the loss to bounce around or continually increase. In the third figure, Loss graph showing a model trained with a learning rate that’s too big, where the loss curve fluctuates wildly, going up and down as the iterations increase.
The loss curve shows the model decreasing and then increasing loss after each iteration, and in the 4th figure, the loss increases at later iterations. Loss graph showing a model trained with a learning rate that’s too big, where the loss curve drastically increases in later iterations.
Batch Size
Batch size is a hyperparameter that refers to the number of examples the model processes before updating its weights and bias. You might think that the model should calculate the loss for every example in the dataset before updating the weights and bias. However, when a dataset contains hundreds of thousands or even millions of examples, using the full batch isn’t practical.
Two common techniques to get the right gradient on average without needing to look at every example in the dataset before updating the weights and bias are stochastic gradient descent and mini-batch stochastic gradient descent:
Stochastic gradient descent (SGD):
Stochastic gradient descent uses only a single example (a batch size of one) per iteration. Given enough iterations, SGD works but is very noisy. “Noise” refers to variations during training that cause the loss to increase rather than decrease during an iteration. The term “stochastic” indicates that the one example comprising each batch is chosen at random. Notice in the image on the right how loss slightly fluctuates as the model updates its weights and bias using SGD, which can lead to noise in the loss graph:
Tip
Note that using stochastic gradient descent can produce noise throughout the entire loss curve, not just near convergence.
Mini-batch stochastic gradient descent (mini-batch SGD)::
Mini-batch stochastic gradient descent is a compromise between full-batch and SGD. For number of data points, the batch size can be any number greater than 1 and less than . The model chooses the examples included in each batch at random, averages their gradients, and then updates the weights and bias once per iteration. Determining the number of examples for each batch depends on the dataset and the available compute resources. In general, small batch sizes behaves like SGD, and larger batch sizes behaves like full-batch gradient descent.
Tip
When training a model, you might think that noise is an undesirable characteristic that should be eliminated. However, a certain amount of noise can be a good thing.
Epochs
During training, an epoch means that the model has processed every example in the training set once. For example, given a training set with 1,000 examples and a mini-batch size of 100 examples, it will take the model 10 iterations to complete one epoch. Training typically requires many epochs. That is, the system needs to process every example in the training set multiple times. The number of epochs is a hyperparameter you set before the model begins training. In many cases, you’ll need to experiment with how many epochs it takes for the model to converge. In general, more epochs produces a better model, but also takes more time to train.
Mean Absolute Error (MAE)
- Measures the average absolute difference between predicted and actual values
- Uses the same unit as the target variable
- Less sensitive to outliers than MSE
\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]
Mean Squared Error (MSE)
- Squares prediction errors
- Large errors are penalized heavily
- Sensitive to outliers
- MSE is commonly used as a loss function during training because it is differentiable.
\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\]
Root Mean Squared Error (RMSE)
- Represents the typical size of prediction error
- Same unit as the target variable
- Sensitive to outliers
\[\text{RMSE} = \sqrt{\text{MSE}}\]
R-Squared (\(R^2\))
- Measures the proportion of variance in the target variable explained by the model
- Compares the model against a baseline that predicts the mean
- Values range from 0 to 1 (higher is better)
- Does NOT measure prediction error
\[
R^2 = 1 -
\frac{
\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2
}{
\sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2
}
\]
Adjusted R-Squared
- Adjusted version of \(R^2\) that accounts for the number of predictors
- Penalizes adding irrelevant features
- Essential for Multiple Linear Regression
- Increases only when a new feature improves the model
\[
\text{Adjusted } R^2 =
1 - \left(
\frac{(1 - R^2)(n - 1)}{n - p - 1}
\right)
\]
Where: \(n\) = number of samples, \(p\) = number of predictors
Given a dataset of land price:
1. Build a linear regression which predicts the land price using both the land_area and the distance_to_city feature. (See the dataset in ’land_price_ _1.csv’)
2. Using only the distance feature, build a model with hypothesis \(h(x) = \theta_{0}+\theta_{1}x+\theta_{3}\sqrt{x}\) to predict the land price. (See the dataset in ‘land_price_2.csv’)
- Predicts a class label (category), which is discrete and unordered
- Can be binary (e.g., spam / not spam) or multi-class (e.g., letter recognition)
- Many classifiers can return a confidence score per class
- Model predictions create a decision boundary separating the classes
There are many machine learning algorithms that can be used for classification tasks. Some of them are:
• Logistic Regression
• Decision Tree Classifier
• Random Forest Classifier
• Support Vector Machine (SVM)
• K-Nearest Neighbors (KNN)
• Naive Bayes Classifier
Exercise2
Instinct Institute
Mork MongkulIntroduction to Machine Learning | AI Bootcamp